Build a Traffic Sign Recognition Project
Here are summary statistics of the traffic signs data sets:
Each datasets have the different frequency distribution of the traffic signs as follows.
In the training dataset, some classes have only two hundreds images, and may cause a shortage to adequately train.
And there is near ten times difference among 43 classes that is possible to unfair train and to infer at low qualities.
The following shows each class's mean value of its image's pixel mean values.
Some classes, like 6, 20, 10 and 8, have very dark images that may mean low contrast dataset.
The following is each class's standard deviation(stdev) of its images's pixel mean value.
Some classs, like 6, 21, 27, 21 and others, have very low deviation compared to other classes that also can cause unfair training.
As explained above, the training dataset may have some issue to train like:
1. sample number shortage in some lables
2. low contrast(dark) images
3. low variance in some classess
Here is a quick look that shows images sampled from the training dataset.
The dataset has a lot of similar images that seem to be augmented via image processing techniques like changing brightness, contrast, chroma and cropping position.
Here are the typical class images(the first image of the class in the training dataset) and class averaged images.
All class averaged images still keep their own characteristic enough to recognize as traffic signs.
But some classes seem to have troubles as follow.
As described above, the training dataset potentially has trouble factors.
So I had a feasibility study before selecting methods for pre-processing, CNN design and augmenting image data in order to reduce the training data risk.
For the feasibility study, I made a reasonable scale model.
This model is bigger than the LeNet-5(lesson 8) and would be smaller than the final model, so I named it "middle model".
Here is the specification of "middle model" and training parameters.
| Layer | Description |
|---|---|
| Input | 32x32x3 RGB/Gray image |
| Convolution 5x5 | 1x1 stride, VALID padding, outputs 28x28x16 |
| Batch Normalization | |
| RELU | |
| Max pooling | 2x2 stride, VALID padding, outputs 14x14x16 |
| Convolution 5x5 | 1x1 stride, VALID padding, outputs 10x10x48 |
| RELU | |
| Max pooling | 2x2 stride, VALID padding, outputs 5x5x48 |
| flatten | 5x5x48 => 1200 |
| Fully connected | outputs 100 |
| RELU | |
| Dropout | keep prob. 0.5 |
| Fully connected | outputs 100 |
| RELU | |
| Dropout | keep prob. 0.5 |
| Softmax | outputs 43 (class number) |
| Title | Description |
|---|---|
| Optimizer | Adam |
| learning_rate | 0.0002 |
| batch size | 100 |
| EPOCH Number | 200 |
Following figure is a pixel mean and stdev distribution for each images in the training dataset.
It shows the dataset is not normalized yet.
To make training work better, following normalization types are possible at first.
Following figures are the distributions of the each normalization types.
The normalization types can control the distribution spread as below.
Following images are averaged class images of each normalization type.
The average images express a part of the effect of the normalization. Relatively to the average images without normalization, the dark brightness issue is declined by type 1 and 2 normalization method. But the low chroma and background texture issues still remain in the normlaized images.
To check the potential of the middle mode, I examined 7 types of input data as follows.
| No | Title | image type | Normalization type |
|---|---|---|---|
| 0 | RGB | RGB-3ch | Not normalized |
| 1 | RGB-Type0 | RGB-3ch | normalized for all pixels in the training data |
| 2 | RGB-Type1 | RGB-3ch | normalized for each images pixels |
| 3 | RGB-Type2 | RGB-3ch | normalized for RGB each image plane pixels |
| 4 | Gray | Gray | Not normalized |
| 5 | Gray-Type0 | Gray | normalized for all pixels in the training data |
| 6 | Gray-Type1 | Gray | normalized for each images pixels |
Normalization is executed by a follwing equation.
normalized_image = (org_image - mean) / (2.0 * stdev)
Following figures are accuracy curves and the last accuracies for the 3 dataset.
After 200 epochs, every types obtained 93% accuracy of the validation dataset and seem to have possibility to get more accuracy as below .
Gray-scale inputs got more validation accuracy, but less test data accuracy than RGB input.
I take take RGB-type1 as the input format to study hereafter, though the feasibility study shows that gray scale gets better validation accuracy than RGB input.
All the 7 input format types, include RGB format, will satisfy the 93% accuracy goal of the project.
So I decided to challenge something like that can solove the low-chroma and the background texture issues above.
The RGB input may be useful to make sure what modification affects to the training issues.
Using "middle model" with RGB input image after the type1-normalization, I got the frequency of the failure of it as below.
This means there may be new trouble other than dataset issue described above.
Compare to the numbers of the training data, classes that have many failure don't seem to have enough training data as below.
All of the failure images have very low-chroma images, and the training dataset for the class dosesn't have such images.
About half of the failure images have very low resolution like that mostly human also may mis-understand.
But the rest of the failure, I can not specify its factor to cause the mis-inferences.
Almost of all the failure images have very dark brightness like that I can not recognaize without something like image enhancements.
All of the failure images have small traffic sign in its scope.
All of the failure images have very dark and low-contrast. But I can not specify the difference to the images successful infered.
All of the failure images have high contrast background. The normalization method may not work well on such images.
I tried to enlargep the filter tap size of the first convolutional networks, because the quick looks above showed that "middle model" may not be enough to express the charactoristics of the each classes.
Following figure shows 4 model architecture's accuracy curve for each epoch.
"5x5" or "7x7" means CNN's tap size, and "0bn" or "1bn" means usage of batch normalization. ("0bn" is No batch normalization model)
No-batch-normalization models reached near their peak accuracy about at epoch 500.
Batch-normalization models had a low accuracy level, at least, before epoch 1000, though they have possibility of more high accuracy at over 1000 epochs.
It might be better for batch normalization models to take more high training-rate than no-batch-normalization models.
Here, to compare under eauql conditions, all the 4 models use 0.0002 as the training-rate.
As the feasibility study, I chose the final model as below. I call the final model architecture "large model".
The unit numbers were set adequate value, watching varying histgram on the Tensorboard. (It's a fantastic tool!)
CNN's filter size 64 / 84 and FC's unit size 240 are moderate values that can get smooth histgrams of their weights.
The final model has two dropout to prevent overfitting.
| Layer | Description |
|---|---|
| Input | 32x32x3 RGB image |
| Convolution 5x5 | 1x1 stride, VALID padding, outputs 28x28x64 |
| RELU | |
| Max pooling | 2x2 stride, VALID padding, outputs 14x14x64 |
| Convolution 5x5 | 1x1 stride, VALID padding, outputs 10x10x84 |
| RELU | |
| Max pooling | 2x2 stride, VALID padding, outputs 5x5x84 |
| flatten | 5x5x48 => 2100 |
| Fully connected | outputs 240 |
| RELU | |
| Dropout | keep prob. 0.5 |
| Fully connected | outputs 240 |
| RELU | |
| Dropout | keep prob. 0.5 |
| Softmax | outputs 43 (class number) |
The training Hyperparameters are same to "middle model".
They are also defined for slow training to prevent overfitting.
| Title | Description |
|---|---|
| Optimizer | Adam |
| learning_rate | 0.0002 |
| batch size | 100 |
| EPOCH Number | 1000 |
Input images were kept color planes and pre-processed via type 1 normalization described above.
This method is not the best way to get the highest accuracy, but valuable to study enforcing the model architecture.
After training on the jupyter notebook, I got the result after 307 epochs as below, though "large model" can get over 0.98 validation accuracy with more epochs.
Following is the accuracy curves for the datasets.
I newly got 5 traffic sign images from the web, and made some analysis for the inferences via "large model".
At first, I got 12 new images searched by "german traffic sign" keywords with licence free opton,
then selected 5 images in the point of view as follows.
| No | input image | image size | view point |
|---|---|---|---|
| 0 | 105 x 106, 96dpi | newly background textures | |
| 1 | 299 x 168, 72dpi | a slant sign board | |
| 2 | 188 x 141, 72dpi | extra texures on the sign board | |
| 3 | 369 x 349, 96dpi | no background texture, but uniformly blue | |
| 4 | 259 x 194, 72dpi | extra texures on the sign board |
Following table is the inference result for the 5 images via "large model".
In spite of unkindness in the images, 4 images were correctly infered and the second probability level were very low.
No.2 image, having a scissors illustration, became error.
No.5 is an example image that is correctly infered as class "17: No entry" for comparison to No.2 image.
| No | score | input image | answer | inference |
|---|---|---|---|---|
| 0 | O | 4 | 4 : Speed limit (70km/h) | |
| 1 | O | 13 | 13 : Yield | |
| 2 | X | 17 | 3 : Speed limit (60km/h) | |
| 3 | O | 33 | 33 : Turn right ahead | |
| 4 | O | 40 | 40 : Roundabout mandatory | |
| 5 | O | 17 | 17 : No entry |
Following text shows the probabilities of the softmax value for each input images.
Other than No.2 image, "large model" exactly infered its answer.
No.2 image was completely confused with class 3 "Speed limit (60km/h)" and the second probability was only 0.01% though the second inference correctly showed class 17.
No.5 image was rightly infered but all the second to fourth probability showed Speed Limit sign board.
It means the class 17 potentially has charactoristics similar to Speed Limit signs.
No.0:
answer: 4: Speed limit (70km/h)
Top1:100.00% class 4: Speed limit (70km/h)
Top2: 0.00% class 0: Speed limit (20km/h)
Top3: 0.00% class 1: Speed limit (30km/h)
Top4: 0.00% class 2: Speed limit (50km/h)
Top5: 0.00% class 3: Speed limit (60km/h)
No.1:
answer: 13: Yield
Top1:100.00% class 13: Yield
Top2: 0.00% class 38: Keep right
Top3: 0.00% class 0: Speed limit (20km/h)
Top4: 0.00% class 1: Speed limit (30km/h)
Top5: 0.00% class 2: Speed limit (50km/h)
No.2:
answer: 17: No entry
Top1: 99.99% class 3: Speed limit (60km/h)
Top2: 0.01% class 17: No entry
Top3: 0.00% class 9: No passing
Top4: 0.00% class 14: Stop
Top5: 0.00% class 32: End of all speed and passing limits
No.3:
answer: 33: Turn right ahead
Top1:100.00% class 33: Turn right ahead
Top2: 0.00% class 25: Road work
Top3: 0.00% class 0: Speed limit (20km/h)
Top4: 0.00% class 1: Speed limit (30km/h)
Top5: 0.00% class 2: Speed limit (50km/h)
No.4:
answer: 40: Roundabout mandatory
Top1: 99.97% class 40: Roundabout mandatory
Top2: 0.02% class 11: Right-of-way at the next intersection
Top3: 0.00% class 18: General caution
Top4: 0.00% class 16: Vehicles over 3.5 metric tons prohibited
Top5: 0.00% class 37: Go straight or left
No.5:
answer: 17: No entry
Top1:100.00% class 17: No entry
Top2: 0.00% class 0: Speed limit (20km/h)
Top3: 0.00% class 1: Speed limit (30km/h)
Top4: 0.00% class 2: Speed limit (50km/h)
Top5: 0.00% class 3: Speed limit (60km/h)
As described above, the original training dataset has shortages in the view points of quality and quantity for some classes.
Here is a summary of subjective issues on the training dataset.
Here is a summary of factors on mis-infered validation dataset.
And I can take augmenting plans to resolve them as below.
At first, I tried augmenting to class 16 that has the most serious trouble in its dataset.
About Class 16, the most of the mis-inference were occured on low-chroma images, and the training data didn't have such images.
Therefore augmenting low chroma images would be effective to improve the accuracy of the class.
Following code is a specific method to augment the dataset.
The code first duplicates a correspond image and make its chroma(saturation) low by a multiplication with 0.4. It also modifies hue and brightness(intensity) to imitate the mis-infered images.
for ans, org in zip(y_train, X_train):
if 16 == ans:
img = np.zeros((32, 32, 3)).astype(np.float32)
img = org.astype(np.float32) / 255.0
Vnoise = np.random.randn(32, 32) * 0.01
hsv = cv2.cvtColor(img, cv2.COLOR_RGB2HSV)
hsv[:, :, 0] = hsv[:, :, 0] + 30
hsv[:, :, 1] = hsv[:, :, 1] * 0.4
hsv[:, :, 1] = hsv[:, :, 1].clip(.05, 0.95)
hsv[:, :, 2] = hsv[:, :, 2] + Vnoise + 0.1
hsv[:, :, 2] = hsv[:, :, 2].clip(.05, 0.95)
img = cv2.cvtColor(hsv, cv2.COLOR_HSV2RGB)
X_train = np.append(X_train, img)
y_train = np.append(y_train, ans)
extra_num += 1
Following pictures are two pairs of an original image and an augmented image that the code created from class 16 training dataset.
The number of the augmented dataset is 360 and the training dataset number is increased from 34799 to 35159.
The increase rate is 1.035%.
Following graph shows how the augmentation improves the validation accuracy with the line "aug c16".
The training hyperparameters are completely same as "large model" including random seed.
The 1.035% augmentation makes an affect averaging around 0.25% on the whole validation accuracy within 500 epochs.
TODO 図をいれかえる
I got the effect of the augmentation for class 16.
Then I augmented the dataset for class 21, 40 and 24 in similar ways.
About class 21, noisy validation data were mis-infered, so augmenting noisy image into the training dataset would be effective.
About class 40 and 24, dark and low-contrast images were mis-infered, so augmenting such images would be effective.
Following code is a specific method to augment the dataset.
The code first duplicates a correspond image and make its chroma(saturation) low by a multiplication with 0.4. It also modifies hue and brightness(intensity) to imitate the mis-infered images.
extra_num = 0
for ans, org in zip(y_train, X_train):
if 16 == ans:
img = np.zeros((32, 32, 3)).astype(np.float32)
img = org.astype(np.float32) / 255.0
Vnoise = np.random.randn(32, 32) * 0.01
hsv = cv2.cvtColor(img, cv2.COLOR_RGB2HSV)
hsv[:, :, 0] = hsv[:, :, 0] + 30
hsv[:, :, 1] = hsv[:, :, 1] * 0.4
hsv[:, :, 1] = hsv[:, :, 1].clip(.05, 0.95)
hsv[:, :, 2] = hsv[:, :, 2] + Vnoise + 0.1
hsv[:, :, 2] = hsv[:, :, 2].clip(.05, 0.95)
img = cv2.cvtColor(hsv, cv2.COLOR_HSV2RGB)
X_train = np.append(X_train, img)
y_train = np.append(y_train, ans)
extra_num += 1
elif 21 == ans:
img = np.zeros((32, 32, 3)).astype(np.float32)
img = org.astype(np.float32) / 255.0
Vnoise = np.random.randn(32, 32) * 0.08
hsv = cv2.cvtColor(img, cv2.COLOR_RGB2HSV)
hsv[:, :, 2] = hsv[:, :, 2] + Vnoise
hsv[:, :, 2] = hsv[:, :, 2].clip(.05, 0.95)
img = cv2.cvtColor(hsv, cv2.COLOR_HSV2RGB)
X_train = np.append(X_train, img)
y_train = np.append(y_train, ans)
extra_num += 1
elif 40 == ans or 24 == ans:
img = np.zeros((32, 32, 3)).astype(np.float32)
img = org.astype(np.float32) / 255.0
Vnoise = np.random.randn(32, 32) * 0.01
hsv = cv2.cvtColor(img, cv2.COLOR_RGB2HSV)
hsv[:, :, 2] = hsv[:, :, 2] * 0.2
hsv[:, :, 2] = hsv[:, :, 2].clip(.05, 0.95)
img = cv2.cvtColor(hsv, cv2.COLOR_HSV2RGB)
X_train = np.append(X_train, img)
y_train = np.append(y_train, ans)
extra_num += 1
print('X_train augmented')
X_train = X_train.reshape(n_train + extra_num, 32, 32, 3)
n_train = len(X_train)
print("Number of training examples =", n_train)
Following pictures are three pairs of an original image and an augmented image that the code created from the class training dataset.
The number of the augmented dataset is 1170 and the training dataset number is increased from 34799 to 35969.
The increase rate is 3.362%.
The 3.362% augmentation makes an affect averaging around 0.5% on the whole validation accuracy within 500 epochs.
Following graph shows how the augmentation improves the validation accuracy.
The training hyperparameters are completely same as "large model" including random seed.
Following graph is a histgram that shows the frequency of the mis-infered validation data.
Class 16 and 21 were quite improved in the inference, but class 24 and 40 were not.
It shows the Augmenting method for class 24 and 40 was unsuitable for it.
After 10000 epochs, the validation accuracy was stabilized within 0.975 to 0.98 (40 epochs average), and the peak accuracy was 0.98481%, while the final model without the augmentation had 0.98209% peak accuracy.
Using a outputFeatureMap function provided by Udacity, I got the inside of "large model" architecture visualized.
Following pictures show featuremaps at conv1 of "large model" when the 5 images was given as input.
Most conv1's featuremap seem like high pass filters that enhances the edges of the traffic signs.
Some featuremaps activate background textures, it may be caused by unsatisfactory training data about background.
As I had not expected, I can not find featuremaps which apparently activate hue or color information in traffic sign.
The informaton whether the sign is red or blue is very important for human, but not for the CNN model. To improve training on color image, I seem to need additinal work on the project.
The second featuremaps would lost most of the pixel position information on my expectation.
But a few of conv2 kept the sign board shape.
That may suggest enlarging conv1 and additional third CNN layer would be effective to improve the accuracy.
I got over 98% accuracy and satisfied the requirement of the project via following methods.
I also had trials for the final model as follow.
Both of the challenges were not contribute improving the peak accuracy.
Augmenting training data that revises the defects in the dataset was quite effective to improve the accuracy.
I had lastly the visualizations inside the CNN, I can have some considerations about my CNN model.
EOF